Domain Adaptation for Visual Understanding by Richa Singh & Mayank Vatsa & Vishal M. Patel & Nalini Ratha
Author:Richa Singh & Mayank Vatsa & Vishal M. Patel & Nalini Ratha
Language: eng
Format: epub
ISBN: 9783030306717
Publisher: Springer International Publishing
where N represents the total number of video frames, start and end represent the start and end point of the local video segment; notice that . We use average pooling to aggregate the features in the time span. Then, L2 Normalization after pooling is applied to rescale the vision features.
Simultaneously, putting local video features and context video features into the model could weakly help the model learn temporal relation between the video segment and the entire video. To model more temporal information that indicates whether the video segment matches the language query, we add a temporal point which represents the time span into video features. The temporal features are also normalized(to [0, 1]) to be in the same numerical scale with video features. Finally, we concatenate video context features , video local features , and temporal features to construct input video representation .
Since a video consists of several still images, we could use knowledge learned from the image dataset to learn the video information. We use the model pretrained on ImageNet [10] to extract appearance feature from the video dataset. Appearance information can represent the object and other attributes in still video frames. In video recognition, motion feature is also widely used to recognize video action in the form of optical flow [17]. To model the motion information of videos, we use a video recognition network [25] to extract motion feature. In our experiments, we construct our vision features individually with the appearance and motion feature. Two ensemble retrieval models are trained respectively with appearance and motion feature and aggregated with late fusion.
The video embedding network is constructed with two fully connected layers with ReLU. The first fully connected layer in each video embedding network is shared to reduce model parameters.
Download
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Computer Vision & Pattern Recognition | Expert Systems |
Intelligence & Semantics | Machine Theory |
Natural Language Processing | Neural Networks |
Algorithms of the Intelligent Web by Haralambos Marmanis;Dmitry Babenko(8529)
Test-Driven Development with Java by Alan Mellor(7438)
Data Augmentation with Python by Duc Haba(7330)
Principles of Data Fabric by Sonia Mezzetta(7076)
Learn Blender Simulations the Right Way by Stephen Pearson(7018)
Microservices with Spring Boot 3 and Spring Cloud by Magnus Larsson(6835)
RPA Solution Architect's Handbook by Sachin Sahgal(6249)
Hadoop in Practice by Alex Holmes(6038)
The Infinite Retina by Robert Scoble Irena Cronin(5952)
Jquery UI in Action : Master the concepts Of Jquery UI: A Step By Step Approach by ANMOL GOYAL(5878)
Big Data Analysis with Python by Ivan Marin(5742)
Life 3.0: Being Human in the Age of Artificial Intelligence by Tegmark Max(5410)
Pretrain Vision and Large Language Models in Python by Emily Webber(4701)
Infrastructure as Code for Beginners by Russ McKendrick(4482)
WordPress Plugin Development Cookbook by Yannick Lefebvre(4213)
Functional Programming in JavaScript by Mantyla Dan(4129)
The Age of Surveillance Capitalism by Shoshana Zuboff(4125)
Embracing Microservices Design by Ovais Mehboob Ahmed Khan Nabil Siddiqui and Timothy Oleson(4004)
Applied Machine Learning for Healthcare and Life Sciences Using AWS by Ujjwal Ratan(3980)
